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Abstract 

In this paper we introduce the idea of improving the performance of parametric temporal- 
difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their 
updates on different time steps. In particular, we show that varying the emphasis of linear 
TD(A)’s updates in a particular way causes its expected update to become stable under 
off-policy training. The only prior model-free TD methods to achieve this with per-step 
computation linear in the number of function approximation parameters are the gradient- 
TD family of methods including TDC, GTD(A), and GQ(A). Compared to these methods, 
our emphatic TD(\) is simpler and easier to use; it has only one learned parameter vector 
and one step-size parameter. Our treatment includes general state-dependent discount¬ 
ing and bootstrapping functions, and a way of specifying varying degrees of interest in 
accurately valuing different states. 

Keywords: temporal-difference learning, off-policy training, function approximation, 

convergence, stability 


1. Parametric Temporal-Difference Learning 

Temporal-difference (TD) learning is perhaps the most important idea to come out of the 
field of reinforcement learning. The problem it solves is that of efficiently learning to make 
a sequence of long-term predictions about how a dynamical system will evolve over time. 
The key idea is to use the change (temporal difference) from one prediction to the next as 
an error in the earlier prediction. For example, if you are predicting on each day what the 
stock-market index will be at the end of the year, and events lead you one day to make a 
much lower prediction, then a TD method would infer that the predictions made prior to 
the drop were probably too high; it would adjust the parameters of its prediction function 
so as to make lower predictions for similar situations in the future. This approach contrasts 
with conventional approaches to prediction, which wait until the end of the year when the 
final stock-market index is known before adjusting any parameters, or else make only short¬ 
term (e.g., one-day) predictions and then iterate them to produce a year-end prediction. 
The TD approach is more convenient computationally because it requires less memory and 
because its computations are spread out uniformly over the year (rather than being bunched 
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together all at the end of the year). A less obvious advantage of the TD approach is that 
it often produces statistically more accurate answers than conventional approaches (Sutton 
1988). 

Parametric temporal-difference learning was first studied as the key “learning by gen¬ 
eralization” algorithm in Samuel’s (1959) checker player. Sutton (1988) introduced the 
TD(A) algorithm and proved convergence in the mean of episodic linear TD(0), the simplest 
parametric TD method. The potential power of parametric TD learning was convincingly 
demonstrated by Tesauro (1992, 1995) when he applied TD(A) combined with neural net¬ 
works and self play to obtain ultimately the world’s best backgammon player. Dayan (1992) 
proved convergence in expected value of episodic linear TD(A) for all A e [0,1], and Tsitsik- 
lis and Van Roy (1997) proved convergence with probability one of discounted continuing 
linear TD(A). Watkins (1989) extended TD learning to control in the form of Q-learning 
and proved its convergence in the tabular case (without function approximation, Watkins 
& Dayan 1992), while Rummery (1995) extended TD learning to control in an on-policy 
form as the Sarsa(A) algorithm. Bradtke and Barto (1996), Boyan (1999), and Nedic and 
Bertsekas (2003) extended linear TD learning to a least-squares form called LSTD(A). Para¬ 
metric TD methods have also been developed as models of animal learning (e.g., Sutton & 
Barto 1990, Klopf 1988, Ludvig, Sutton & Kehoe 2012) and as models of the brain’s reward 
systems (Schultz, Dayan & Montague 1997), where they have been particularly influential 
(e.g., Niv & Schoenbaum 2008, O’Doherty 2012). Sutton (2009, 2012) has suggested that 
parametric TD methods could be key not just to learning about reward, but to the learning 
of world knowledge generally, and to perceptual learning. Extensive analysis of parametric 
TD learning as stochastic approximation is provided by Bertsekas (2012, Chapter 6) and 
Bertsekas and Tsitsiklis (1996). 

Within reinforcement learning, TD learning is typically used to learn approximations to 
the value function of a Markov decision process (MDP). Here the value of a state s, denoted 
Vtt(s), is defined as the sum of the expected long-term discounted rewards that will be 
received if the process starts in s and subsequently takes actions as specified by the decision¬ 
making policy 7r, called the target policy. If there are a small number of states, then it may 
be practical to approximate the function v n by a table, but more generally a parametric 
form is used, such as a polynomial, multi-layer neural network, or linear mapping. Also 
key is the source of the data, in particular, the policy used to interact with the MDP. If 
the data is obtained while following the target policy 7r, then good convergence results are 
available for linear function approximation. This case is called on-policy learning because 
learning occurs while “on” the policy being learned about. In the alternative, off-policy 
case, one seeks to learn about v^ while behaving (selecting actions) according to a different 
policy called the behavior policy , which we denote by /j. Baird (1995) showed definitively 
that parametric TD learning was much less robust in the off-policy case by exhibiting 
counterexamples for which both linear TD(0) and linear Q-learning had unstable expected 
updates and, as a result, the parameters of their linear function approximation diverged to 
infinity. This is a serious limitation, as the off-policy aspect is key to Q-learning (perhaps 
the single most popular reinforcement learning algorithm), to learning from historical data 
and from demonstrations, and to the idea of using TD learning for perception and world 
knowledge. 


2 



An Emphatic Approach to Off-policy TD Learning 


Over the years, several different approaches have been taken to solving the problem of 
off-policy learning. Baird (1995) proposed an approach based on gradient descent in the 
Bellman error for general parametric function approximation that has the desired compu¬ 
tational properties, but which requires access to the MDP for double sampling and which 
in practice often learns slowly. Gordon (1995, 1996) proposed restricting attention to func¬ 
tion approximators that are averagers, but this does not seem to be possible without storing 
many of the training examples, which would defeat the primary strength that we seek to ob¬ 
tain from parametric function approximation. The LSTD(A) method was always relatively 
robust to off-policy training (e.g., Lagoudakis & Parr 2003, Yu 2010, Mahmood, van Hasselt 
Sz Sutton 2014), but its per-step computational complexity is quadratic in the number of 
parameters of the function approximator, as opposed to the linear complexity of TD(A) 
and the other methods. Perhaps the most successful approach to date is the gradient-TD 
approach (e.g., Maei 2011, Sutton et al. 2009, Maei et al. 2010), including hybrid methods 
such as HTD (Hackman 2012). Gradient-TD methods are of linear complexity and guaran¬ 
teed to converge for appropriately chosen step-size parameters but are more complex than 
TD(A) because they require a second auxiliary set of parameters with a second step size 
that must be set in a problem-dependent way for good performance. The studies by White 
(in preparation), Geist and Scherrer (2014), and Dann, Neumann, and Peters (2014) are 
the most extensive empirical explorations of gradient-TD and related methods to date. 

In this paper we explore a new approach to solving the problem of off-policy TD learning 
with function approximation. The approach has novel elements but is similar to that devel¬ 
oped by Precup, Sutton, and Dasgupta in 2001. They proposed to use importance sampling 
to reweight the updates of linear TD(A), emphasizing or de-emphasizing states as they were 
encountered, and thereby create a weighting equivalent to the stationary distribution under 
the target policy, from which the results of Tsitsiklis and Van Roy (1997) would apply and 
guarantee convergence. As we discuss later, this approach has very high variance and was 
eventually abandoned in favor of the gradient-TD approach. The new approach we explore 
in this paper is similar in that it also varies emphasis so as to reweight the distribution of 
linear TD(A) updates, but to a different goal. The new goal is to create a weighting equiva¬ 
lent to the followon distribution for the target policy started in the stationary distribution 
of the behavior policy. The followon distribution weights states according to how often they 
would occur prior to termination by discounting if the target policy was followed. 

Our main result is to prove that varying emphasis according to the followon distribution 
produces a new version of linear TD(A), called emphatic TD(\), that is stable under general 
off-policy training. By “stable” we mean that the expected update over the ergodic distri¬ 
bution (Tsitsiklis &: Van Roy 1997) is a contraction, involving a positive definite matrix. We 
concentrate on stability in this paper because it is a prerequisite for full convergence of the 
stochastic algorithm. Demonstrations that the linear TD(A) is not stable under off-policy 
training have been the focus of previous counterexamples (Baird 1995, Tsitsiklis Sz Van Roy 
1996, 1997, see Sutton Sz Barto 1998). Substantial additional theoretical machinery would 
be required for a full convergence proof. Recent work by Yu (in preparation) builds on our 
stability result to prove that the emphatic TD(A) converges with probability one. 

In this paper we first treat the simplest algorithm for which the difficulties of off-policy 
temporal-difference (TD) learning arise—the TD(0) algorithm with linear function approx¬ 
imation. We examine the conditions under which the expected update of on-policy TD(0) 
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is stable, then why those conditions do not apply under off-policy training, and finally how 
they can be recovered for off-policy training using established importance-sampling methods 
together with the emphasis idea. After introducing the basic idea of emphatic algorithms 
using the special case of TD(0), we then develop the general case. In particular, we consider 
a case with general state-dependent discounting and bootstrapping functions, and with a 
user-specified allocation of function approximation resources. Our new theoretical results 
and the emphatic TD(A) algorithm are presented fully for this general case. Empirical ex¬ 
amples elucidating the main theoretical results are presented in the last section prior to the 
conclusion. 

2. On-policy Stability of TD(O) 

To begin, let us review the conditions for stability of conventional TD(A) under on-policy 
training with data from a continuing finite Markov decision process. Consider the simplest 
function approximation case, that of linear TD(A) with A = 0 and constant discount-rate 
parameter 7 E [0,1). Conventional linear TD(0) is defined by the following update to the 
parameter vector 6 t E M n , made at each of a sequence of time steps t = 0,1,2,..., on 
transition from state St £ S to state St+ 1 E S, taking action A t E A and receiving reward 

Rt +1 £ IRh 

Ot +1 = e t +a (Rt +1 + 70 t T <KSt +1 ) - ej 0(s t )) 4>(s t ), (i) 

where a > 0 is a step-size parameter, and <f>(s) E M n is the feature vector corresponding to 
state s. The notation “=” indicates an equality by definition rather than one that follows 
from previous definitions. In on-policy training, the actions are chosen according to a target 
policy 7T : A x § — > [0,1], where 7r(a|s) = P{Aj = a\St = s}. The state and action sets S and A 
are assumed to be finite, but the number of states is assumed much larger than the number 
of learned parameters, |S| = N n, so that function approximation is necessary. We use 
linear function approximation, in which the inner product of the parameter vector and the 
feature vector for a state is meant to be an approximation to the value of that state: 

djcf)(s) ^ v n (s) =E 7T [G t \S t = s], (2) 

where E^f-] denotes an expectation conditional on all actions being selected according to 7 r, 
and Gt, the return at time t. is defined by 

Gt = Rt +1 + lRt+i + 7 2 ^+3 + • • • • (3) 

The TD(0) update (jTj) can be rewritten to make the stability issues more transparent: 

Ot+i = 0t + a( Rt+MS t ) - 0(5 t ) (0(5 t ) - 7 0(S m )) T 0 t ) 
b t eK n AieK nXrl 

= 0t + a(b* - A t 9 t ) (4) 

= (I - aA t )G t + ab f . 

The matrix A t multiplies the parameter 6t and is thereby critical to the stability of the 
iteration. To develop intuition, consider the special case in which A t is a diagonal matrix. 
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If any of the diagonal elements are negative, then the corresponding diagonal element of 
I — aAj will be greater than one, and the corresponding component of Qt will be amplified, 
which will lead to divergence if continued. (The second term (ab*) does not affect the 
stability of the iteration.) On the other hand, if the diagonal elements of A* are all positive, 
then a can be chosen smaller than one over the largest of them, such that I—aA( is diagonal 
with all diagonal elements between 0 and 1. In this case the first term of the update tends 
to shrink Of, and stability is assured. In general, 6t will be reduced toward zero whenever 
At is positive definite {]] 

In actuality, however, A t and b* are random variables that vary from step to step, in 
which case stability is determined by the steady-state expectation, Hindoo E[A*]. In our set¬ 
ting, after an initial transient, states will be visited according to the steady-state distribution 
under ^ r. We represent this distribution by a vector d^, each component of which gives the 
limiting probability of being in a particular stat^] [d,r] s A d n (s) = lirn^oo P{5j = s}, which 
we assume exists and is positive at all states (any states not visited with nonzero probability 
can be removed from the problem). The special property of the steady-state distribution is 
that once the process is in it, it remains in it. Let P T denote the N x N matrix of transition 
probabilities [P^ij A ^2 a n(a\i)p(j\i,a) where p(j\i,a) = F{S t+ i=j\S t = i, A t = a}. Then 
the special property of d^ is that 

Pjd 7r = d 7r . (5) 

Consider any stochastic algorithm of the form and let A A lim^oo E[A*] and 
b A Hindoo E[bt]. We define the stochastic algorithm to be stable if and only if the 
corresponding deterministic algorithm, 


@t+\ — Qt + a (t> — A Qt), 


( 6 ) 


is convergent to a unique fixed point independent of the initial Oq. This will occur iff the 
A matrix has a full set of eigenvalues all of whose real parts are positive. If a stochastic 
algorithm is stable and a is reduced according to an appropriate schedule, then its parameter 
vector may converge with probability one. However, in this paper we focus only on stability 
as a prerequisite for convergence, leaving convergence itself to future work. If the stochastic 
algorithm converges, it is to a fixed point 6 of the deterministic algorithm, at which A 6 = b, 
or 6 = A _1 b. (Stability assures existence of the inverse.) In this paper we focus on 
establishing stability by proving that A is positive definite. From definiteness it immediately 
follows that A has a full set of eigenvectors (because y T Ay > 0,Vy / 0) and that the 
corresponding eigenvalues all have real parts 0 


1. A real matrix A is defined to be positive definite in this paper iff y T Ay > 0 for any real vector y ^ 0. 

2. Here and throughout the paper we use brackets with subscripts to denote the individual elements of 
vectors and matrices. 

3. To see the latter, let Re(a;) denote the real part of a complex number x, and let y* denotes the conjugate 
transpose of a complex vector y. Then, for any eigenvalue-eigenvector pair A,y: 0 < Re(y*Ay) = 
Re(y*Ay) = Re(A)y*y ==> 0 < Re(A). 
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Now let us return to analyzing on-policy TD(0). Its A matrix is 

0(S t )(0(St)- 7 <t>(S t+1 )) T 

= ^d*-(s) <j>(s)(c/>(s) - 7 ^[P 7 r ] ss / 0 (s / ) 
s V s' 

= $ t D,(I- 7 P^)$, 

where $ is the N x n matrix with the (f>(s) as its rows, and D x is the N x N diagonal 
matrix with on its diagonal. This A matrix is typical of those we consider in this paper 
in that it consists of and wrapped around a distinctive N x N matrix that varies 
with the algorithm and the setting, and which we call the key matrix. An A matrix of this 
form will be positive definite whenever the corresponding key matrix is positive definite]^] 
In this case the key matrix is D 7r (I — yP^). 

For a key matrix of this type, positive definiteness is assured if all of its columns sum 
to a nonnegative number. This was shown by Sutton (1988, p. 27) based on two previously 
established theorems. One theorem says that any matrix M is positive definite if and 
only if the symmetric matrix S = M + M T is positive definite (Sutton 1988, appendix). 
The second theorem says that any symmetric real matrix S is positive definite if all of its 
diagonal entries are positive and greater than the sum of the corresponding off-diagonal 
entries (Varga 1962, p. 23). For our key matrix, D T (I — 7 P 71 -), the diagonal entries are 
positive and the off-diagonal entries are negative, so all we have to show is that each row 
sum plus the corresponding column sum is positive. The row sums are all positive because 
Pn- is a stochastic matrix and 7 < 1. Thus it only remains to show that the column sums 
are nonnegative. Note that the row vector of the column sums of any matrix M can be 
written as 1 T M, where 1 is the column vector with all components equal to 1. The column 
sums of our key matrix, then, are: 

l T D 7r (I — yPjr) = dj(l — yP-n-) 

= dJ- 7 dJP 7r ) 

= dj - 

= (1 - 7)d vr , 

all components of which are positive. Thus, the key matrix and its A matrix are positive 
definite, and on-policy TD(0) is stable. Additional conditions and a schedule for reducing 
a over time (as in Tsitsiklis and Van Roy 1997) are needed to prove convergence with 
probability one, 6^ = A _1 b, but the analysis above includes the most important steps 
that vary from algorithm to algorithm. 

4. Strictly speaking, positive definiteness of the key matrix assures only that A is positive semi-definite, 
because it is possible that €>y = 0 for some y ^ 0 , in which case y T Ay will be zero as well. To rule 
this out, we assume, as is commonly done, that the columns of 4> are linearly independent (i.e., that the 
features are not redundant), and thus that <l>y = 0 only if y = 0 . If this were not true, then convergence 
(if it occurs) may not be to a unique 6 ac , but rather to a subspace of parameter vectors all of which 
produce the same approximate value function. 


(by ©) 
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3. Instability of Off-policy TD(O) 

Before developing the off-policy setting in detail, it is useful to understand informally why 
TD(0) is susceptible to instability. TD learning involves learning an estimate from an es¬ 
timate, which can be problematic if there is generalization between the two estimates. For 
example, suppose there is a transition between two states with the same feature represen¬ 
tation except that the second is twice as big: 



where here 8 and 28 are the estimated values of the two states—that is, their feature 
representations are a single feature that is 1 for the first state and 2 for the second (Tsitsiklis 
& Van Roy 1996, 1997). Now suppose that 8 is 10 and the reward on the transition is 0. 
The transition is then from a state valued at 10 to a state valued at 20. If 7 is near 1 and a 
is 0.1, then 8 will be increased to approximately 11. But then the next time the transition 
occurs there will be an even bigger increase in value, from 11 to 22 , and a bigger increase in 
8, to approximately 12.1. If this transition is experienced repeatedly on its own, then the 
system is unstable and the parameter increases without bound—it diverges. We call this 
the 8^-28 problem. 

In on-policy learning, repeatedly experiencing just this single problematic transition 
cannot happen, because, after the highly-valued 28 state has been entered, it must then be 
exited. The transition from it will either be to a lesser or equally-valued state, in which 
case 8 will be significantly decreased, or to an even higher-valued state which in turn must 
be followed by an even larger decrease in its estimated value or a still higher-valued state. 
Eventually, the promise of high value must be made good in the form of a high reward, or 
else estimates will be decreased, and this ultimately constrains 8 and forces stability and 
convergence. In the off-policy case, however, if there is a deviation from the target policy 
then the promise is excused and need never be fulfilled. Later in this section we present a 
complete example of how the 8 28 problem can cause instability and divergence under 

off-policy training. 

With these intuitions, we now detail our off-policy setting. As in the on-policy case, the 
data is a single, infinite-length trajectory of actions, rewards, and feature vectors generated 
by a continuing finite Markov decision process. The difference is that the actions are 
selected not according to the target policy 7 r, but according to a different behavior policy 
p : A x S —> [0,1], yet still we seek to estimate state values under 7 r (as in ([ 2 ])). Of course, 
it would be impossible to estimate the values under 7 r if the actions that it would take 
were never taken by p and their consequences were never observed. Thus we assume that 
p(a\s ) > 0 for every state and action for which 7 r(a|s) > 0. This is called the assumption 
of coverage. It is trivially satisfied by any e-greedy or soft behavior policy. As before we 
assume that there is a stationary distribution d fl {s) = lim t _>.oc IP{S) = s} > 0,Vs £ S, with 
corresponding N- vector d^. 

Even if there is coverage, the behavior policy will choose actions with proportions dif¬ 
ferent from the target policy. For example, some actions taken by p might never be chosen 
by 7 r. To address this, we use importance sampling to correct for the relative probability of 
taking the action actually taken, A t , in the state actually encountered, St, under the target 
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and behavior policies: 

^ n(A t \St) 

Pt ~ n(A t \s t y 

This quantity is called the importance sampling ratio at time t. Note that its expected 
value is one: 

E fi [p t \S t = s] = ^2n(a\s)' K j 1 ^ = ^ 7 r(a|s) = 1. 

u(a\s) 

The ratio will be exactly one only on time steps on which the action probabilities for the two 
policies are exactly the same; these time steps can be treated the same as in the on-policy 
case. On other time steps the ratio will be greater or less than one depending on whether 
the action taken was more or less likely under the target policy than under the behavior 
policy, and some kind of correction is needed. 

In general, for any random variable Zt+i dependent on St, At and St+i, we can recover 
its expectation under the target policy by multiplying by the importance sampling ratio: 

E fl [p t Z t+1 \S t = s\ = ^ /a(a\s)^j^-Z t+1 

= y^7r(q|s)Z m 

a 

= E n [Z t+1 \S t = s\, Vs G S. (7) 

We can use this fact to begin to adapt TD(0) for off-policy learning (Precup, Sutton & 
Singh 2000). We simply multiply the whole TD(0) update 0 by pt- 

e t+ i = e t + p t a ^Rt+i + 7 o]cf ) t+ 1 - ej <p t ( 8 ) 

= Ot + a( p t R t +ipt ~ pt<Pt (Pt ~ lPt+i) T 6t), 

-v-' s -v-' ' 

bt A t 


where here we have used the shorthand (pt = p(St). Note that if the action taken at time t 
is never taken under the target policy in that state, then pt = 0 and there is no update on 
that step, as desired. We call this algorithm off-policy TD(0). 

Off-policy TD(0)’s A matrix is 


A = lim E[At] 

t->-oo 


= lim E. 

£—>oo 


PtPt (Pt — lPt+1 


nT 


^ ( d^(s)E^ Pk(pk (Pk 7 Pk+ 


^ ( d/^(s)E 7r <pk (pk 7<?>fc+i 


nT 


Sk = s 


S k = s 


(by 0) 


^2 d ^( S ) ( ^( S ) ~~ V 


= $ T D At (I- 7 P^)$, 
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where is the N x N diagonal matrix with the stationary distribution on its diagonal. 
Thus, the key matrix that must be positive definite is D^(I — yP^) and, unlike in the on- 
policy case, the distribution and the transition probabilities do not match. We do not have 
an analog of 7 ^ d^, and in fact the column sums may be negative and the matrix 

not positive definite, in which case divergence of the parameter is likely. 

A simple 9 —> 29 example of divergence that fits the setting in this section is shown in 
Figure [TJ From each state there are two actions, left and right, which take the process to the 
left or right states. All the rewards are zero. As before, there is a single parameter 9 and 
the single feature is 1 and 2 in the two states such that the approximate values are 9 and 
29 as shown. The behavior policy is to go left and right with equal probability from both 
states, such that equal time is spent on average in both states, d M = (0.5,0.5) T . The target 
policy is to go right in both states. We seek to learn the value from each state given that 
the right action is continually taken. The transition probability matrix for this example is: 


P* 


0 1 
0 1 


The key matrix is 




'0.5 

0 ' 

X 

'1 

-0.9' 


'0.5 

-0.45' 

0 

0.5 

0 

0.1 


0 

0.05 


(9) 


We can see an immediate indication that the key matrix may not be positive definite in 
that the second column sums to a negative number. More definitively, one can show that 
it is not positive definite by multiplying it on both sides by y = $ = (1, 2) T : 




[1 



0.5 

0 


-0.45 

0.05 



[1 



-0.4 

0.1 


- 0 . 2 . 


That this is negative means that the key matrix is not positive definite. We have also 
calculated here the A matrix; it is this negative scalar, A = —0.2. Clearly, this expected 
update and algorithm are not stable. 

It is also easy to see the instability of this example more directly, without matrices. 
We know that only transitions under the right action cause updates, as pt will be zero for 
the others. Assume for concreteness that initially 0t = 10 and that a. = 0.1. On a right 
transition from the first state the update will be 

0t+i = 9t + PtOL (^Rt+i + 7 Oj<t>t+i ~ Oj4>t 

= 10 + 2 • 0.1 (0 + 0.9 • 10 • 2 - 10 • 1) 1 

= 10 + 1 . 6 , 


A = 0 
7 = 0.9 



h(right|-) = 0.5 
7r(right|) = 1 


Figure 1: 9^-29 example without a terminal state. 
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whereas, on a right transition from the second state the update will be 

Ot+i = o t + pta (jit+i + 4 > t+ 1 - ej fa ) 4> t 

= 10 + 2 • 0.1 (0 + 0.9 • 10 • 2 - 10 • 2) 2 
= 10 - 0 . 8 . 


These two transitions occur equally often, so the net change will be positive. That is, 9 will 
increase, moving farther from its correct value, zero. Everything is linear in 9 , so the next 
time around, with a larger starting 8, the increase in 6 will be larger still, and divergence 
occurs. A smaller value of a would not prevent divergence, only reduce its rate. 

4. Off-policy Stability of Emphatic TD(0) 

The deep reason for the difficulty of off-policy learning is that the behavior policy may 
take the process to a distribution of states different from that which would be encountered 
under the target policy, yet the states might appear to be the same or similar because of 
function approximation. Earlier work by Precup, Sutton and Dasgupta (2001) attempted 
to completely correct for the different state distribution using importance sampling ratios to 
reweight the states encountered. It is theoretically possible to convert the state weighting 
from dfj, to d n using the product of all importance sampling ratios from time 0, but in 
practice this approach has extremely high variance and is infeasible for the continuing (non- 
episodic) case. It works in theory because after converting the weighting the key matrix is 
D 7r (I — yPjr) again, which we know to be positive definite. 

Most subsequent works abandoned the idea of completely correcting for the state dis¬ 
tribution. For example, the work on gradient-TD methods (e.g., Sutton et al. 2009, Maei 
2011) seeks to minimize the mean-squared projected Bellman error weighted by d^. We call 
this an excursion setting because we can think of the contemplated switch to the target 
policy as an excursion from the steady-state distribution of the behavior policy, d^. The 
excursions would start from and then follow n until termination, followed by a resump¬ 
tion of p and thus a gradual return to d fl . Of course these excursions never actually occur 
during off-policy learning, they are just contemplated, and thus the state distribution in 
fact never leaves d^. It is the excursion view that we take in this paper, but still we use 
techniques similar to those introduced by Precup et al. (2001) to determine an emphasis 
weighting that corrects for the state distribution, only toward a different goalj^] 

The excursion notion suggests a different weighting of TD(0) updates. We consider that 
at every time step we are beginning a new contemplated excursion from the current state. 
The excursion thus would begin in a state sampled from d^. If an excursion started it would 
pass through a sequence of subsequent states and actions prior to termination. Some of the 
actions that are actually taken (under p) are relatively likely to occur under the target policy 
as compared to the behavior policy, while others are relatively unlikely; the corresponding 
states can be appropriately reweighted based on importance sampling ratios. Thus, there 
will still be a product of importance sampling ratios, but only since the beginning of the 
excursion, and the variance will also be tamped down by the discounting; the variance will 

5. Kolter (2011) also suggested adapting the distribution of states at which updates are made to improve 
convergence and solution quality, but did not provide a linear-complexity algorithm. 
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be much less than in the earlier approach. In the simplest case of an off-policy emphatic 
algorithm, the update at time t is emphasized or de-emphasized proportional to a new scalar 
variable Ft, defined by Fq = 1 and 


F t = 'ypt-iFt-i + 1, Vf > 0, 


( 10 ) 


which we call the followon trace. Specifically, we define emphatic TD(0) by the following 
update: 


G t+ i =G t + aF t p t (^R t+ 1 + 7 6j 0 t+1 - 6jc/) t 

= 6 t + a( FtptRt+i&t ~ F t p t cj)t (0t - 70i+i) T 0 t ) 

V s -v-' '-v-- / 

b t A t 

Emphatic TD(0)’s A matrix is 

A = lim E[A t ] = lim F t p t (\)t (0t - 70t+i) 

t—too * 


( 11 ) 


t—> OO 


E dfi{s) hm E M F t p t <f> t ( 4> t - 7 4>t+i] 

T .— 


T 


St = s 


YMs) lim E M [Fi|St = s}EJp t (t) t {<f> t - 70t+i) 

z * t—>00 


(because, given St, Ft is independent of pt4>t (0t — 70t+i) T ) 

E d A1 (s) lim E^FtlSt = s} E M p k cf) k (</> k - 70 fc +i) 

T .— 


S, = s 


s k = s 


f(s) 


Y /( s ) e tt 4>k ( 4>k - 70fc+i 


Sk = s 


(by 0) 


T 


= Y ( S ) ^( S ) ^( s ) - 7 X^[ P 7rW0(s')J 

= $F(I- 7 P^, 

where F is a diagonal matrix with diagonal elements /(s) = d M (s) lim^oo E M [Fj|S't = s], 
which we assume exists. As we show later, the vector f E with components [f] s = /(s) 
can be written as 


f = d M + 7 P.J d M + () d M + 


= I-7P, 


-1 


( 12 ) 

(13) 


The key matrix is F (I — 7 P 71 -), and the vector of its column sums is 

1 t F(I — 7P71-) = f T (I — 7P71-) 

= d^(I — 7 P 7r ) _1 (I — 7 Ptt) 

= d J, 


(using (fl3l» 
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all components of which are positive. Thus, the key matrix and the A matrix are positive 
definite and the algorithm is stable. Emphatic TD(0) is the simplest TD algorithm with 
linear function approximation proven to be stable under off-policy training. 

The 6^26 example presented earlier (Figure[I]) provides some insight into how replacing 
Du by F changes the key matrix to make it positive definite. In general, f is the expected 
number of time steps that would be spent in each state during an excursion starting from 
the behavior distribution d^. From (12), it is d At plus where you would get to in one step 
from d M , plus where you would get to in two steps, etc., with appropriate discounting. In 
the example, excursions under the target policy take you to the second state ( 26 ) and leave 
you there. Thus you are only in the first state (6) if you start there, and only for one step, 
so /(1) = d M ( 1) = 0.5. For the second state, you can either start there, with probability 0.5, 
or you can get there on the second step (certain except for discounting), with probability 
0.9, or on the third step, with probability 0.9 2 , etc, so /(2) = 0.5 + 0.9 + 0.9 2 + 0.9 3 + • • ■ = 
0.5 + 0.9 • 10 = 9.5. Thus, the key matrix is now 


F(I - 7 P*) 


'0.5 

0 ' 

X 

'1 

-0.9' 


'0.5 

-0.45' 

0 

9.5 

0 

0.1 


0 

0.95 


Note that because F is a diagonal matrix, its only effect is to scale the rows. Here it 
emphasizes the lower row by more than a factor of 10 compared to the upper row, thereby 
causing the key matrix to have positive column sums and be positive definite (cf. ([9])). The 
F matrix emphasizes the second state, which would occur much more often under the target 
policy than it does under the behavior policy. 


5. The General Case 

We turn now to a very general case of off-policy learning with linear function approximation. 
The objective is still to evaluate a policy n from a single trajectory under a different policy 
/r, but now the value of a state is defined not with respect to a constant discount rate 
7 £ [0,1], but with respect to a discount rate that varies from state to state according 
to a discount function 7 : S —> [0,1] such that Iltli 7 (<%+&) = 0, w.p.l,V£. That is, our 
approximation is still defined by ([ 2 ]), but now ([ 3 ]) is replaced by 

Gt = Rt+ 1 + 'y(St.+i)Rt+2 + 'y(St+i)'y(St+2)Rt+3 + • • • • (14) 

State-dependent discounting specifies a temporal envelope within which received rewards are 
accumulated. If 7 (Sk) = 0 , then the time of accumulation is fully terminated at step k > t, 
and if 7 (Sk) < 1, then it is partially terminated. We call both of these soft termination 
because they are like the termination of an episode, but the actual trajectory is not affected. 
Soft termination ends the accumulation of rewards into a return, but the state transitions 
continue oblivious to the termination. Soft termination with state-dependent termination 
is essential for learning models of options (Sutton et al. 1999) and other applications. 

Soft termination is particularly natural in the excursion setting, where it makes it easy 
to define excursions of finite and definite duration. For example, consider the deterministic 
MDP shown in Figure [2] There are five states, three of which do not discount at all, 
7 ( 5 ) = 1 , and are shown as circles, and two of which cause complete soft termination, 
7 (s) = 0, and are shown as squares. The terminating states do not end anything other 
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V„= 4 Vir = 3 V v = 2 V 7T = 1 I/jr — 1 

1111 



/ii(left| •) = 2/3 

7r(right|-) = 1 


Figure 2: A 5-state chain MDP with soft-termination states at each end. 


than the return; actions are still selected in them and, dependent on the action selected, 
they transition to next states indefinitely without end. In this MDP there are two actions, 
left and right, which deterministically cause transitions to the left or right except at the 
edges, where there may be a self transition. The reward on all transitions is +1. The 
behavior policy is to select left 2/3rds of the time in all states, which causes more time to 
be spent in states on the left than on the right. The stationary distribution can be shown 
to be « (0.52,0.26,0.13,0.06, 0.03) t ; more than half of the time steps are spent in the 
leftmost terminating state. 

Consider the target policy n that selects the right action from all states. The correct 
value v n (s) of each state s is written above it in the figure. For both of the two rightmost 
states, the right action results in a reward of 1 and an immediate termination, so their values 
are both 1. For the middle state, following n (selecting right repeatedly) yields two rewards 
of 1 prior to termination. There is no discounting ( 7 = 1) prior to termination, so the middle 
state’s value is 2 , and similarly the values go up by 1 for each state to its left, as shown. 
These are the correct values. The approximate values depend on the parameter vector 6f 
as suggested by the expressions shown inside each state in the figure. These expressions 
use the notation 6i to denote the itli component of the current parameter vector Of. In 
this example, there are five states and only three parameters, so it is unlikely, and indeed 
impossible, to represent v n exactly. We will return to this example later in the paper. 

In addition to enabling definitive termination, as in this example, state-dependent dis¬ 
counting enables a much wider range of predictive questions to be expressed in the form of 
a value function (Sutton et al. 2011, Modayil, White Sz Sutton 2014, Sutton, Rafols Sz Koop 
2006), including option models (Sutton, Precup Sz Singh 1999, Sutton 1995). For example, 
with state-dependent discounting one can formulate questions both about what will happen 
during a way of behaving and what will be true at its end. A general representation for 
predictions is a key step toward the goal of representing world knowledge in verifiable pre¬ 
dictive terms (Sutton 2009, 2012). The general form is also useful just because it enables 
us to treat uniformly many of the most important episodic and continuing special cases of 
interest. 

A second generalization, developed for the first time in this paper, is to explicitly specify 
the states at which we are most interested is obtaining accurate estimates of value. Recall 
that in parametric function approximation there are typically many more states than pa¬ 
rameters (N 3> n), and thus it is usually not possible for the value estimates at all states 
to be exactly correct. Valuing some states more accurately usually means valuing others 
less accurately, at least asymptotically. In the tabular case where much of the theory of 
reinforcement learning originated, this tradeoff is not an issue because the estimates of each 
state are independent of each other, but with function approximation it is necessary to spec- 
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ify relative interest in order to make the problem well defined. Nevertheless, in the function 
approximation case little attention has been paid in the literature to specifing the relative 
importance of different states (an exception is Thomas 2014), though there are intimations 
of this in the initiation set of options (Sutton et al. 1999). In the past it was typically 
assumed that we were interested in valuing states in direct proportion to how often they 
occur, but this is not always the case. For example, in episodic problems we often care 
primarily about the value of the first state, or of earlier states generally (Thomas 2014). 
Here we allow the user to specify the relative interest in each state with a nonnegative 
interest function i : S —> [0, oo). Formally, our objective is to minimize the Mean Square 
Value Error (MSVE) with states weighted both by how often they occur and by our interest 
in them: 

MSVE(0) = ^ d^i^fv^s) - 0 T <£(s)) 2 . (15) 

For example, in the 5-state example in Figure [2j we could choose i(s) = l,Vs E S, in 
which case we would be primarily interested in attaining low error in the states on the left 
side, which are visited much more often under the behavior policy. If we want to counter 
this, we might chose i(s) larger for states toward the right. Of course, with parametric 
function approximation we presumably do not have access to the states as individuals, but 
certainly we could set i(s) as a function of the features in s. In this example, choosing 
i(s) = 1 + <fe(s) + 203 (s) (where 4>i(s) denotes the ith component of 0(s)) would shift the 
focus on accuracy to the states on the right, making it substantially more balanced. 

The third and final generalization that we introduce in this section is general bootstrap¬ 
ping. Conventional TD(A) uses a bootstrapping parameter A E [0,1]; we generalize this 
to a bootstrapping function A : § —> [0,1] specifying a potentially different degree of boot¬ 
strapping, 1 — A(s), for each state s. General bootstrapping of this form has been partially 
developed in several previous works (Sutton 1995, Sutton & Barto 1998, Maei & Sutton 
2010, Sutton et al. 2014, cf. Yu 2012). As a notational shorthand, let us use A t = X(St) and 
7 1 = 7 (St). Then we can define a general notion of bootstrapped return, the A-return with 
state-dependent bootstrapping and discounting: 


Gt — Rt+i + 7t+i 


(1 - Af+i )0j4> t + i + At+iG^ +1 


(16) 


The A-return plays a key role in the theoretical understanding of TD methods, in particular, 
in their forward views (Sutton Sz Barto 1998, Sutton, Mahmood, Precup &; van Hasselt 
2014). In the forward view, is thought of as the target for the update at time t, even 
though it is not available until many steps later (when complete termination j(Sk) = 0 has 
occurred for the first time for some k > t). 

Given these generalizations, we can now specify our final new algorithm, emphatic 
TD(\), by the following four equations, for t > 0: 


Qt+i — Of + a (^Rt+i + 'Yt+iOj4>t+ 1 — e t (17) 

&t = Pt (ltX t e t -i + M t <f) t ) , with e_i = 0 (18) 

M t = Xti(St) + (l-Xt)F t (19) 

F t = pt-i'jtFt-i + i{S t ), with Fq = i(S 0 ), (20) 
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where Ft > 0 is a scalar memory called the followon trace. The quantity Mf > 0 is termed 
the emphasis on step t. 


6. Off-policy Stability of Emphatic TD(A) 

As usual, to analyze the stability of the new algorithm we examine its A matrix. The 
stochastic update can be written: 


e t 


@t +1 — 0 t + a (^Rt.+i + 4>t +1 — Oj4>t^ 

= e t + a(^ e t R t+ 1 - e t {4> t - 'yt+i<j>t+i) T ^ O^j . 


Thus, 


A = lim E[A t ] = lim E M e t ( 4> t - ^ t +i4>t+i) 

t—too * 


T 


t—too 


E dfi(s) lim E m e t (fa - ^ t+1 <f) t+1 ) 
£—>00 


St = s 


E dfi{s) lim E m p t {p/ t \ t e t -i + M t fa) ( fa ~ It+ifa.+i) 
£—>00 


T 


S t = 8 


Y ^dfas) lim E M [( 7 t Atet_i + Mt0t)|S' t = s]E /i p t (fa - lt+ifa+i) 

Z * £—>• OO 


T 


(because, given St, et- 1 and Mt are independent of pt(fa — 'yt+ifa+i) ) 

E dfas) lim E A1 [( 7 t A t e t _i + M t fa)\S t = s\ E^ p k (4>k ~ 'y k+ i4> k +i) 

t. — 


S t = s 


Sk = s 


e(s)eR" 

^ ( e(s)E 7r \(f> k 7fc+l 4 > k+ 1 1 Rk ■s] 


T 


(by 0) 


( 21 ) 


= ( ^( s ) “ ^[ p ^]^'7(s / )0(s / )J 

= E(i-p ff r)$, 

where E is an N x n matrix E T = [e(l), • • • , e(JV)], and e(s) G M n is defined b}|^] 

e(s) = dAs) lim KJ'ytXt^t-i + M t fa\St = s\ (assuming this exists) 

£—>■00 

= dfas) lim E /1 [M t \St = s}4>{s) +'y(s)X(s)d f _ l (s) lim E M [e t „i|S , t = s] 

£—>■00 £—>oo 


m(s) 


m(s)c/)(s)+'y(s)X(s)d^(s) lim V'P{ 1 Si_i=s, A t _i=a|S' t =s}E /i [e t _i|S , t _i=s, A t _i=a] 

£—>oo • 


6. Note that this is a slight abuse of notation; e t is a vector random variable, one per time step, and e(s) 
is a vector expectation, one per state. 
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m(s)<t>(s) + 7 (s)A(s)d /i (s) ^ 


s,a 


d,i(s)u(a\s)p(s\s, a) r 

M y/f \ lim [e t _! IS't-i = s, = a] 

Cl^yS) t—> oo 


(using the definition of a conditional probability, a.k.a. Bayes rule) 

i(s)cf)(s) +j(s)\(s^ d^(s) p(a\s)p(s\s, limEa['Yt-i^t-iet- 2 +M t -i(j>t-i\St 

z —' u a s Hoo 

an l \ l / 


= m[ 


-1=S 


s,a 


m(s)<p(s) + 7 (s)A(s) E(E 7r(a|s)p(s|s, a) j e(s) 
m(s)0(s) + 7 (s)A(s) y^[P 7r ]s S e(s)- 


We now introduce three N x N diagonal matrices: M, which has the m(s) = d^s) lim^oo 
[Mt|St = s] on its diagonal; T, which has the 7 ( 5 ) on its diagonal; and A, which has the 
A(s) on its diagonal. With these we can write the equation above entirely in matrix form, 
as 


E t = d> M + E T P 7r rA 

= d> M + $ T MP,rA + $ T M(P ff rA) 2 + • • • 
= # T M(i-p 7 r rA)~ 1 . 


Finally, combining this equation with (21) we obtain 


a = $ t m(i - p T rA) _1 (i - p^r)^, 


and through similar steps one can also obtain emphatic TD(A)’s b vector, 

b = E r?r = $ M(I - P^rA)” 1 ^, 


( 22 ) 


(23) 


where r n is the IV-vector of expected immediate rewards from each state under n. 

Emphatic TD(A)’s key matrix, then, is M(I — P 7 r rA) _1 (I — P^T). To prove that it is 
positive definite we will follow the same strategy as we did for emphatic TD(0). The first 
step will be to write the last part of the key matrix in the form of the identity matrix minus 
a probability matrix. To see how this can be done, consider a slightly different setting in 
which actions are taken according to 7 r, and in which 1 — 7 ( 5 ) and 1 — A(s) are considered 
probabilities of ending by terminating or by bootstrapping, respectively. That is, for any 
starting state, a trajectory involves a state transition according to P^, possibly terminating 
according to I — T, then possibly ending with a bootstrapping event according to I — A, and 
then, if neither of these occur, continuing with another state transition and more chances 
to end, and so on until an ending of one of the two kinds occurs. For any start state i E S, 
consider the probability that the trajectory ends in state j E S with an ending event of the 
bootstrapping kind (according to I — A). Let P* be the matrix with this probability as its 
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ij th component. This matrix can be written 

= p 7r r(i - A) + p^rAP w r(i - A) + p,r(AP 7r r) 2 (i — a) h— 
= (^(P.rA) fc Vr(i- A) 

' k =0 ' 

= (i - p 7r rA) _1 p 7r r(i - A). 

= (i - p^rA) -1 (P w r - p^ta) 

= (i - p ff rA) -1 (P T r -1 +1 - p^rA) 

= i - (i - p w rA) _1 (i - p^r), 


or, 


i-p^li-p^rAr^i-p^r) 


(24) 

It follows then that m(i - p£) = m(i - p^rA)- 1 ^ - p^r) is another way of writing 
emphatic TD(A)’s key matrix (cf. (22)). This gets us considerably closer to our goal of 
proving that the key matrix is positive definite. It is now immediate that its diagonal entries 
are nonnegative and that its off diagonal entries are nonpositive. It is also immediate that 
its row sums are nonnegative. 

There remains what is typically the hardest condition to satisfy: that the column sums 
of the key matrix are positive. To show this we have to analyze M, and to do that we first 
analyze the A-vector f with components f(s) = dp{s) lirn^oc E M [Ft \ St = s] (we assume that 
this limit and expectation exist). Analyzing f will also pay the debt we incurred in Section 
4 when we claimed without proof that f (in the special case treated in that section) was as 
given by (13). In the general case: 

f(s) = dp(s) lirn = 

£—> OO 


= d M (s) lim E M [i(£ t ) + pt-i'y t Ft-i\S t = s\ 


£->-oo 


(by (20)) 


dfji(s)i(s) + dp(s)j(s) lirn V'P{5i_i = s, A i _i = a|S' i = s} ( Ep{F t _i\S t 

t->oo A-' « a s 

an i \ t / 

dn(s)i(s) + d„(s) 7 (s) 


-i = s 


s, a 


dfi(s)n(a\s)p(s\s, a) 7r(a|s) rc , , c 

--7TTT7 lim E m [F t -1 St_i=s 

a pis) /i(a|s) t ->°o 

(using the definition of a conditional probability, a.k.a. Bayes rule) 

= dp(s)i{s) + ir(a\s)p(s\s,a)dp(s) lim Ep[F t -i\S t -i = s] 

z —' t—>oo 

s, a 

= dp{s)i(s ) + 7 (s) J^[P n]ssf(s). 

s 

This equation can be written in matrix-vector form, letting i be the A r -vector with compo¬ 
nents [i] s = dp{s)i(s ): 


f = i + rpJ f 


= i + rpj i + (rpT ) 2 i + 


i-tpJ 


-i 


(25) 
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This proves (13), because there i(s) = 1, Vs (thus i = d M ), and 7 (s) = 7 ,Vs. 

We are now ready to analyze M, the diagonal matrix with the m(s) on its diagonal: 


m(s) = d u (s) lim E fi [M t \S t = s\ 

t—> OO 

= d,j,(s) lim E,j,[\ti(St) + (1 - X t )F t \S t = s} (by (fl9l>) 

t—>00 1 1 

= o? /i (s)A(s)i(s) + (1 - A(s)) f(s), 

or, in matrix-vector form, letting m be the IV-vector with components m(s), 


m 


Ai + (I — A)f 

Ai + (I - A) (i - TPj' 
A(I - TP/) + (I - A) 

(i-atp w t ) (i - rpj 


-1 # 

i 

(I- 

-1 

I i 


rP. T ) 


- 1 ; 


(using (25)) 


(26) 


(using (24)) 


Now we are ready for the final step of the proof, showing that all the columns of the key 
matrix M(I — P*) sum to a positive number. Using the result above, the vector of column 
sums is 


1 t M(I — Pj)) = m T (I — P^) 

= i T (I-P^)- 1 (I-P^) 

= i T . 

If we further assume that i(s) > 0,Vs E S, then the column sums are all positive, the key 
matrix is positive definite, and emphatic TD(A) and its expected update are stable. This 
result can be summarized in the following theorem, the main result of this paper, which we 
have just proved: 

Theorem 1 (Stability of Emphatic TD(A)) For any 

• Markov decision process (S), At, with finite state and actions sets S and A, 

• behavior policy p, with a stationary invariant distribution d^s) > 0,Vs 6 S, 

• target policy it with coverage, i.e., s.t., if n(a\s) > 0, then p(a\s) > 0, 

• discount function 7 : S —> [0,1] s.t. n^Li T^t+fc) = 0, w.p.iyt > 0, 

• bootstrapping function A : § —> [0,1], 

• interest function i : § — > (0, 00 ), 

• feature function </> : S -> M n s.t. the matrix $ G Rl s l xn with the 0(s) as its rows has 
linearly independent columns, 
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the A matrix of linear emphatic TD(\) (as given by (T^20), and assuming the existence 
of lim t ^ 00 E[F t \S t = s] and lim^oo E[e t |5 t = s], Vs G §), 


A = lim E m [A t ] = lim E M e t ( (f>t - jt+i<th+i) 

t~> OO t—>oo 


T 


= $ T M(I- P*)$, 
is positive definite. Thus the algorithm and its expected update are stable. 


(27) 


As mentioned at the outset, stability is necessary but not always sufficient to guarantee 
convergence of the parameter vector Of. Yu (in preparation) has recently built on our 
stability result to show that in fact emphatic TD(A) converges with probability one when 
the step size a is reduced appropriately over time. Convergence as anticipated is to the 
unique fixed point 6 of the deterministic algorithm Q, in other words, to 

AO = b or 0 = A _1 b. (28) 

This solution can be characterized as a minimum (in fact, a zero) of the Projected Bellman 
Error (PBE, Sutton et al. 2009) using the A-dependent Bellman operator T M : M. N —> M. N 
(Tsitiklis & Van Roy 1997) and the weighting of states according to their emphasis. For our 
general case, we need a version of the T^ operator extended to state-dependent discounting 
and bootstrapping. This operator looks ahead to future states to the extent that they are 
bootstrapped from, that is, according to P^, taking into account the reward received along 
the way. The appropriate operator, in vector form, is 

r (A ) V = (I —PJ^-^ + Piv. (29) 


This operator is a contraction with fixed point v = v^. Recall that our approximate value 
function is < 1 > 0 , and thus the difference between < 1>0 and T^\&6) is a Bellman-error vector. 
The projection of this with respect to the feature matrix and the emphasis weighting is the 
emphasis-weighted PBE: 


PBE(0) = n [T (A) {$0) - $0 

= $($ t M$)' 1 $ t M (t ( x \$0 ) - $ 0 ) (see Sutton et al. 2009) 

= $($ t M$)' 1 $ t M ((I - P^rA)” 1 ^ + P A $0 - $0) (by @) 

(by 


= $($ 1 M3>) -1 [h + M(P a - I)$0 
= $($ T M$)" 1 (b-A0). (by ([27])) 

From (28), it is immediate that this is zero at the fixed point 0, thus PBE(0) = 0. 

Finally, let us reconsider our assumption in this section that the interest function i(s) is 
strictly greater than zero at all states. If the interest were allowed to be zero at some states, 
then the key matrix would not necessarily be positive definite, but by continuity it seems 
that it would still have to be positive semi-definite (meaning y T Ay is positive or zero for 
all vectors y). Semi-definiteness of the key matrix may well be sufficient for most purposes. 
In particular, we conjecture that it is sufficient to assure convergence (under appropriate 
step-size conditions) of the estimated values for all states s with nonzero emphasis, m(s) 
(cf. Wang & Bertsekas 2013). A second advantage of a convergence result based on serni- 
dehniteness is that it would presumably also enable removing the artificial assumption that 
the columns of the feature matrix $ are linearly independent. 
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7. Derivation of the Emphasis Algorithm 

Emphatic algorithms are based on the idea that if we are updating a state by a TD method, 
then we should also update each state that it bootstraps from, in direct proportion. For 
example, suppose we decide to update the estimate at time t with unit emphasis, perhaps 
because i(St) = 1, and then at time t + 1 we have 7 (St+i) = 1 and A(St+i) = 0 . Because of 
the latter, we are fully bootstrapping from the value estimate at t +1 and thus we should also 
make an update of it with emphasis equal to t’s emphasis. If instead A(£f+i) = 0.5, then 
the update of the estimate at t + 1 would gain a half unit of emphasis, and the remaining 
half would still be available to allocate to the updates of the estimate at t + 2 or later 
times depending on their As. And of course there may be some emphasis allocated directly 
updating the estimate at t + 1 if i(St+ 1 ) > 0. Discounting and importance sampling also 
have effects. At each step t, if 7 (St) < 1, then there is some degree of termination and to 
that extent there is no longer any chance of bootstrapping from later time steps. Another 
way bootstrapping may be cut off is if pt = 0 (a complete deviation from the target policy). 
More generally, if p 7 ^ 1, then the opportunity for bootstrapping is scaled up or down 
proportionally. 

It may seem difficult to work out precisely how each time step’s estimates bootstrap 
from which later states’ estimates for all cases. Fortunately, it has already been done. 
Equation ( 6 ) of the paper by Sutton, Mahmood, Precup, and van Hasselt (2014) specifies 
this in their “forward view” of off-policy TD(A) with general state-dependent discounting 
and bootstrapping. From this equation (and their (5)) it is easy to determine the degree to 
which the update of the value estimate at time k bootstraps from (multiplicatively depends 
on) the value estimates of each subsequent time t. It is 

Pk I II li\Pi ) 7t(l - At). 

\i=k+ 1 ) 


It follows then that the total emphasis on time t, Mf. should be the sum of this quantity 
for all times k < t, each times the emphasis M k for those earlier times, plus any intrinsic 
interest i(St) in time t: 


t -1 


t -1 


M t = i(S t ) + ^2 M kPk li\Pi 71(! - At) 


k =0 




t -1 £-1 

= A t i(S t ) + (1 - A t)i(St) + (1 - A t )7t ^ p k M k 7 i\pi 

k =0 i=k-\-l 

= Atz(St) + (1 - A t )Ft, 


which is (19), where 

t-i 


t-i 


Ft = i(S t ) + 7 1^2 PkMk II 11XiPi 

k =0 i=k+l 

( t—2 t—1 \ 

Pt-lMt-l + ^2 Pk M k li\Pi 
k =0 i=k -\-1 / 


20 



An Emphatic Approach to Off-policy TD Learning 


t -2 


t-2 


= i(S t ) + 7t pt-\M t -i + pt-iXt-ilt-i ^2 P kMk II 7A 


&1=0 2=/c+l 




i—2 


t—2 


i{S t ) + 'YtPt—i ( + (1 - +A t _i7t_i ^ p k M k -pXiPi 


M t -i 


k=0 i=k-\-l 


= i(S t ) + n/tPt-i (Ft-1 + At-1 (~F t -1 + i(St-i) + 7 t-i ^ PkM k 7 i Aip i 'j j 
V ' k =0 i=k +1 ' / 


= z(5t) + 'jtPt-iFt-i, 


which is (20), completing the derivation of the emphasis algorithm. 


8. Empirical Examples 


In this section we present empirical results with example problems that verify and elucidate 
the formal results already presented. A thorough empirical comparison of emphatic TD(A) 
with other methods is beyond the scope of the present article. 

The main focus in this paper, as in much previous theory of TD algorithms with function 
approximation, has been on the stability of the expected update. If an algorithm is unstable, 
as Q-learning and off-policy TD(A) are on Baird’s (1995) counterexample, then there is no 
chance of its behaving in a satisfactory manner. On the other hand, even if the update is 
stable it may be of very high variance. Off-policy algorithms involve products of potentially 
an infinite number of importance-sampling ratios, which can lead to fluxuations of infinite 
variance. 


As an example of what can happen, let’s look again at the 9 —> 29 problem shown in 
Figure 1 (and shown again in the upper left of Figure 3). Consider what happens to Ft in 
this problem if we have interest only in the first state, and the right action happens to be 
taken on every step (i.e., i(So) = 1 then i(St) = 0,Vt > 0, and A t = right,Vt > 0). In this 
case, from (20), 


t -1 

F t = Pt-\ltF t -i + i(S t ) = pj'y = (2 • 0.9)*, 

3 =o 

which of course goes to infinity as t —> oo. On the other hand, the probability of this 
specific infinite action sequence is zero, and in fact Ft will rarely take on very high values. 
In particular, the expected value of Ft remains finite at 

E^F)] = 0.5 • 2 • 0.9 • + 0.5 • 0 • 0.9 • E ^[F t _ x ] 

= 0.9-E A1 [E t _ 1 ] 

= 0.9*, 
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which tends to zero as t —>• oo. Nevertheless, this problem is indeed a difficult case, as the 
variance of Ff is infinite: 


Var[F t ] = E [T 2 ] - (E[F)]) 2 

= 0.5*(2*0.9*) 2 - (0.9*) 2 
= (0.9 2 • 2)* - ( 0 . 9 2 )* 

= 1.62*-0.81*, 


which tends to oo as t —> oo. 

So what actually happens on this problem? The thin blue lines in Figure 3 (left) show 
the trajectories of the single parameter 0 over time in 50 runs with this problem with A = 0 
and a = 0.001, starting at 0 = 1.0. We see that most trajectories of emphatic TD(0) rapidly 
approached the correct value of 0 = 0, but a few made very large steps away from zero and 
then returned. Because the variance of Ft (and thus of Mf and e*) grows to infinity as t 
tends to infinity, there is always a small chance of an extremely large fluxuation taking 0 far 
away from zero. Off-policy TD(0), on the other hand, diverged to infinity in all individual 
runs. 

For comparison, Figure 3 (right) shows trajectories for a 0 —> 20 problem in which Ft 
and all the other variables and their variances are bounded. In this problem, the target 
policy of selecting right on all steps leads to a soft terminal state (y(s) = 0) with fixed 
value zero, which then transitions back to start again in the leftmost state, as shown in the 
upper right of the figure. (This is an example of how one can reproduce the conventional 



/i(right|-) = 0.5 
7r(right|) = 1 



/x(right|-) = 0.1 
7r(right| ■) = 1 



Figure 3: Emphatic TD approaches the correct value of zero, whereas conventional off- 
policy TD diverges, on fifty trajectories on the 0 —> 20 problems shown above 
each graph. Also shown as a thick line is the trajectory of the deterministic 
expected-update algorithm ([6]). On the continuing problem (left) emphatic TD 
had occasional high variance deviations from zero. 
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notions of terminal state and episode in a soft termination setting.) Here we have chosen the 
behavior policy to take the action left with probability 0.9, so that its stationary distribution 
distinctly favors the left state, whereas the target policy would spend equal time in each 
of the two states. This change increases the variance of the updates, so we used a smaller 
step size, a = 0.0001; other settings were unchanged. Conventional off-policy TD(0) still 
diverged in this case, but emphatic TD(0) converged reliably to zero. 

Finally, Figure 4 shows trajectories for the 5-state example shown earlier (and again 
in the upper part of the figure). In this case, everything is bounded under the target 
policy, and both algorithms converged. The emphatic algorithm achieved a lower MSVE in 
this example (nevertheless, we do not mean to claim any general empirical advantage for 
emphatic TD(A) at this time). 

Also shown in these figures as a thick dark line is the trajectory of the deterministic 
algorithm: Ot+i = @t + a(b — A Ot) Q. Tsitsiklis and Van Roy (1997) argued that, for 
small step-size parameters and in the steady-state distribution, on-policy TD(A) follows 
its expected-update algorithm in an “average” sense, and we see much the same here for 
emphatic TD(A). 

These examples show that although emphatic TD(A) is stable for any MDP and all 
functions A, 7 and (positive) i, for some problems and functions the parameter vector 
continues to fluxuate with a chance of arbitrarily large deviations (for constant a > 0). It 
is not clear how great of a problem this is. Certainly it is much less of a problem than 



0 + 

0 


—1-1-1-1-1 

10000 20000 30000 40000 50000 

steps 


Figure 4: Twenty learning curves and their analytic expectation on the 5-state problem 
from Section 5, in which excursions terminate promptly and both algorithms 
converge reliably. Here A = 0, 0q = 0, a = 0.001, and i(s) = l,Vs. The MSVE 
performance measure is defined in (15). 
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the positive instability (Baird 1995) that can occur with off-policy TD(A) (stability of the 
expected update precludes this). The possibility of large fluxuations may be inherent in any 
algorithm for off-policy learning using importance sampling with long eligibility traces. For 
example, the updates of GTD(A) and GQ(A) (Maei 2011) with A = 1 will tend to infinite 
variance as t —> oo on Baird’s counterexample and on the example in Figures 1 and 3(left). 
And, as mentioned earlier, convergence with probability one can still be guaranteed if a is 
reduced appropriately over time (Yu, in preparation). 

In practice, however, even when asymptotic convergence can be guaranteed, high vari¬ 
ance can be problematic as it may require very small step sizes and slow learning. High 
variance frequently arises in off-policy algorithms when they are Monte Carlo algorithms 
(no TD learning) or they have eligibility traces with high A (at A = 1, TD algorithms be¬ 
come Monte Carlo algorithms). In both cases the root problem is the same: importance 
sampling ratios that become very large when multiplied together. For example, in the 
6 —» 26 problem discussed at the beginning of this section, the ratio was only two, but the 
products of successive twos rapidly produced a very large F t . Thus, the first way in which 
variance can be controlled is to ensure that large products cannot occur. We are actually 
concerned with products of both p t s and y^s. Occasional termination (y t = 0), as in the 
5-state problem, is thus one reliable way of preventing high variance. Another is through 
choice of the target and behavior policies that together determine the importance sampling 
ratios. For example, one could define the target policy to be equal to the behavior policy 
whenever the followon or eligibility traces exceed some threshold. These tricks can also 
be done prospectively. White (in preparation) proposed that the learner compute at each 
step the variance of what GTD(A)’s traces would be on the following step. If the variance 
is in danger of becoming too large, then X t is reduced for that step to prevent it. For 
emphatic TD(A), the same conditions could be used to adjust yt or one of the policies to 
prevent the variance from growing too large. Another idea for reducing variance is to use 
weighted importance sampling (as suggested by Precup et al. 2001) together with the ideas 
of Mahmood et al. (2014, in preparation) for extending weighted importance sampling to 
linear function approximation. Finally, a good solution may even be found by something 
as simple as bounding the values of Ft or e*. This would limit variance at the cost of bias, 
which might be a good tradeoff if done properly. 

9. Conclusions and Future Work 

We have introduced a way of varying the emphasis or strength of the updates of TD learning 
algorithms from step to step, based on importance sampling, that should result in much 
lower variance than previous methods (Precup et al. 2001). In particular, we have introduced 
the emphatic TD(A) algorithm and shown that it solves the problem of instability that 
plagues conventional TD(A) when applied in off-policy training situations in conjunction 
with linear function approximation. Compared to gradient-TD methods, emphatic TD(A) 
is simpler in that it has a single parameter vector and a single step size rather than two 
of each. The per-time-step complexities of gradient-TD and emphatic-TD methods are 
both linear in the number of parameters; both are much simpler than quadratic complexity 
methods such LSTD(A) and its off-policy variants. We have also presented a few empirical 
examples of emphatic TD(0) compared to conventional TD(0) adapted to off-policy training. 


24 



An Emphatic Approach to Off-policy TD Learning 


These examples illustrate some of emphatic TD(A)’s basic strengths and weaknesses, but 
a proper empirical comparison with other methods remains for future work. Extensions of 
the emphasis idea to action-value and control methods such as Sarsa(A) and Q(A), to true- 
online forms (van Seijen Sz Sutton 2014), and to weighted importance sampling (Mahmood 
et al. 2014, in preparation) are also natural and remain for future work. Yu (in preparation) 
has recently extended the emphatic idea to a least-squares algorithm and proved it and our 
emphatic TD(A) convergent with probability one. 

Two additional ideas for future work deserve special mention. 

First, note that the present work has focused on ways of ensuring that the key matrix 
is positive definite, which implies positive definiteness of the A matrix and thus that the 
update is stable. An alternative strategy would be to work directly with the A matrix. 
Recall that the A matrix is vastly smaller than the key matrix; it has a row and column 
for each feature, whereas the key matrix has a row and column for each state. It might be 
feasible then to keep statistics for each row and column of A, whereas of course it would not 
be for the large key matrix. For example, one might try to use such statistics to directly 
test for diagonal dominance (and thus positive definiteness) of A. If it were possible to 
adjust some of the free parameters (e.g., the A or i functions) to ensure positive definiteness 
while reducing the variance of Ft, then a substantially improved algorithm might be found. 

The second idea for future work is that the emphasis algorithm, by tracing the depen¬ 
dencies among the estimates at various states, is doing something clever that ought to show 
up as improved bounds on the asymptotic approximation error. The bound given by Tsit- 
siklis and Van Roy (1997) probably cannot be significantly improved if A, 7 , i, and p are 
all constant, because in this case emphasis asymptotes to a constant that can be absorbed 
into the step size. But if any of these vary from step to step, then emphatic TD(A) is 
genuinely different and may improve over conventional TD(A). In particular, consider an 
episodic on-policy case where i(s) = 1 and A(s) = 0, for all s € S, and y(s) = 1 for all states 
except for a terminal state where it is zero (and from which a new episode starts). In this 
case emphasis would increase linearly within an episode to a maximum on the final state, 
whereas conventional TD(A) would give equal weight to all steps within the episode. If 
the feature representation were insufficient to represent the value function exactly, then the 
emphatic algorithm might improve over the conventional algorithm in terms of asyptotic 
MSVE (15). Similarly, improvements in asymptotic MSVE over conventional algorithms 
might be possible whenever i varies from state to state, such as in the common episodic case 
in which we are interested only in accurately valuing the start state of the episode, and yet 
we choose A < 1 to reduce variance. There may be a wide range of interesting theoretical 
and empirical work to be done along these lines. 


Acknowledgements 

The authors thank Hado van Hasselt, Doina Precup, Huizhen Yu, and Brendan Bennett 
for insights and discussions contributing to the results presented in this paper, and the 
entire Reinforcement Learning and Artificial Intelligence research group for providing the 
environment to nurture and support this research. We gratefully acknowledge funding from 
Alberta Innovates - Technology Futures and from the Natural Sciences and Engineering 
Research Council of Canada. 


25 



Sutton, Mahmood & White 


References 

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approx¬ 
imation. In Proceedings of the 12th International Conference on Machine Learning , 
pp. 30-37. Morgan Kaufmann, San Francisco. Important modifications and errata added 
to the online version on November 22, 1995. 

Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control: Approximate Dy¬ 
namic Programming , Fourth Edition. Athena Scientific, Belmont, MA. 

Bertsekas, D. P., Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific, 
Belmont, MA. 

Boyan, J. A., (1999). Least-squares temporal difference learning. In Proceedings of the 16th 
International Conference on Machine Learning , pp. 49-56. 

Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference 
learning. Machine Learning 22: 33-57. 

Dayan, P. (1992). The convergence of TD(A) for general A. Machine Learning 5:341-362. 

Dann, C., Neumann, G., Peters, J. (2014). Policy evaluation with temporal differences: A 
survey and comparison. Journal of Machine Learning Research 15:809-883. 

Geist, M., Scherrer, B. (2014). Off-policy learning with eligibility traces: A survey. Journal 
of Machine Learning Research 15: 289-333. 

Gordon, G. J. (1995). Stable function approximation in dynamic programming. In 
A. Prieditis and S. Russell (eds.), Proceedings of the 12th International Conference on 
Machine Learning , pp. 261-268. Morgan Kaufmann, San Francisco. An expanded ver¬ 
sion was published as Technical Report CMU-CS-95-103. Carnegie Mellon University, 
Pittsburgh, PA, 1995. 

Gordon, G. J. (1996). Stable fitted reinforcement learning. In D. S. Touretzky, M. C. Mozer, 
M. E. Hasselmo (eds.), Advances in Neural Information Processing Systems: Proceedings 
of the 1995 Conference, pp. 1052-1058. MIT Press, Cambridge, MA. 

Hackman, L. (2012). Faster Gradient-TD Algorithms. MSc thesis, University of Alberta. 

Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiology 16( 2):85- 
125. 

Kolter, J. Z. (2011). The fixed points of off-policy TD. In Advances in Neural Information 
Processing Systems 24, pp. 2169-2177. 

Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learn¬ 
ing Research ^:1107-1149. 

Ludvig, E. A., Sutton, R. S., Kehoe, E. J. (2012). Evaluating the TD model of classical 
conditioning. Learning & behavior 40{ 3):305-319. 

Maei, H. R. (2011). Gradient Temporal-Difference Learning Algorithms. PhD thesis, Uni¬ 
versity of Alberta. 


26 



An Emphatic Approach to Off-policy TD Learning 


Maei, H. R., Sutton, R. S. (2010). GQ(A): A general gradient algorithm for temporal- 
difference prediction learning with eligibility traces. In Proceedings of the Third Confer¬ 
ence on Artificial General Intelligence, pp. 91-96. Atlantis Press. 

Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Sutton, R. S. (2010). Toward off-policy learning 
control with function approximation. In Proceedings of the 27th International Conference 
on Machine Learning , pp. 719-726. 

Mahmood, A. R., van Hasselt, H., Sutton, R. S. (2014). Weighted importance sampling for 
off-policy learning with linear function approximation. Advances in Neural Information 
Processing Systems 27. 

Mahmood, A. R., Sutton, R. S. (in preparation). Off-policy learning based on weighted 
importance sampling with linear computational complexity. 

Modayil, J., White, A., Sutton, R. S. (2014). Multi-timescale nexting in a reinforcement 
learning robot. Adaptive Behavior 22{ 2): 146-160. 

Nedic, A., Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear 
function approximation. Discrete Event Dynamic Systems 75(l-2):79-110. 

Niv, Y., Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in cognitive sci¬ 
ences 12{ 7):265-272. 

O’Doherty, J. P. (2012). Beyond simple reinforcement learning: The computational neurobi¬ 
ology of reward learning and valuation. European Journal of Neuroscience 35 (7):987-990. 

Precup, D., Sutton, R. S., Dasgupta, S. (2001). Off-policy temporal-difference learning with 
function approximation. In Proceedings of the 18th International Conference on Machine 
Learning, pp. 417-424. 

Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. 
In Proceedings of the 17th International Conference on Machine Learning, pp. 759-766. 
Morgan Kaufmann. 

Rummery, G. A. (1995). Problem Solving with Reinforcement Learning. PhD thesis, Uni¬ 
versity of Cambridge. 

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM 
Journal on Research and Development 3: 210-229. Reprinted in E. A. Feigenbaum, & 
J. Feldman (Eds.), Computers and thought. New York: McGraw-Hill. 

Schultz, W., Dayan, P., Montague, P. R. (1997). A neural substrate of prediction and 
reward. Science 275 (5306): 1593-1599. 

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine 
Learning 5:9-44, erratum p. 377. 

Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In 
Proceedings of the 12th International Conference on Machine Learning, pp. 531-539. 
Morgan Kaufmann. 


27 



Sutton, Mahmood & White 


Sutton, R. S. (2009). The grand challenge of predictive empirical abstract knowledge. 
Working Notes of the IJCAI-09 Workshop on Grand Challenges for Reasoning from Ex¬ 
periences. 

Sutton, R. S. (2012). Beyond reward: The problem of knowledge and data. In Proceedings 
of the 21st International Conference on Inductive Logic Programming, S. H. Muggleton, 
A. Tamaddoni-Nezhad, F. A. Lisi (Eds.): ILP 2011, LNAI 7207, pp. 2-6. Springer, 
Heidelberg. 

Sutton, R. S., Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In 
M. Gabriel and J. Moore (Eds.), Learning and Computational Neuroscience: Foundations 
of Adaptive Networks , pp. 497-537. MIT Press, Cambridge, MA. 

Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. 

Sutton, R. S., Mahmood, A. R., Precup, D., van Hasselt, H. (2014). A new Q(A) with in¬ 
terim forward view and Monte Carlo equivalence. In Proceedings of the 31st International 
Conference on Machine Learning. JMLR W&CP 32(2). 

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, 
E. (2009). Fast gradient-descent methods for temporal-difference learning with linear 
function approximation. In Proceedings of the 26th International Conference on Machine 
Learning , pp. 993-1000, ACM. 

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., Precup, D. 
(2011). Horde: A scalable real-time architecture for learning knowledge from unsuper¬ 
vised sensorimotor interaction. In Proceedings of the 10th International Conference on 
Autonomous Agents and Multiagent Systems, pp. 761-768. 

Sutton, R. S., Precup D., Singh, S. (1999). Between MDPs and semi-MDPs: A framework 
for temporal abstraction in reinforcement learning. Artificial Intelligence 112:181-211. 

Sutton, R. S., Rafols, E. J., Koop, A. (2006). Temporal abstraction in temporal-difference 
networks. Advances in Neural Information Processing Systems 18. MIT Press. 

Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning 
8: 257-277. 

Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of t 
he ACM 38 { 3):58-68. 

Thomas, P. (2014). Bias in natural actor-critic algorithms. In Proceedings of the 31st 
International Conference on Machine Learning. JMLR W&CP 32(l):441-448. 

Tsitsiklis, J. N., Van Roy, B. (1996). Feature-based methods for large scale dynamic 
programming. Machine Learning, 22:59-94. 

Tsitsiklis, J. N., Van Roy, B. (1997). An analysis of temporal-difference learning with 
function approximation. IEEE Transactions on Automatic Control ^2:674-690. 

van Seijen, H., Sutton, R. S. (2014). True online TD(A). In Proceedings of the 31st Inter¬ 
national Conference on Machine Learning. JMLR W&CP 32(l):692-700. 


28 



An Emphatic Approach to Off-policy TD Learning 


Varga, R. S. (1962). Matrix Iterative Analysis. Englewood Cliffs, NJ: Prentice-Hall. 

Wang, M., Bertsekas, D. P. (2013). Stabilization of stochastic iterative methods for singular 
and nearly singular linear systems. Mathematics of Operations Research 39( 1): 1-30. 

Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, University of 
Cambridge. 

Watkins, C. J. C. H., Dayan, P. (1992). Q-learning. Machine Learning 8: 279-292. 

White, A. (in preparation). Developing a Predictive Approach to Knowledge. Phd thesis, 
University of Alberta. 

Yu, H. (2010). Convergence of least squares temporal difference methods under general 
conditions. In Proceedings of the 27th International Conference on Machine Learning, 
pp. 1207-1214. 

Yu, H. (2012). Least squares temporal difference methods: An analysis under general 
conditions. SIAM Journal on Control and Optimization 50(6), 3310-3343. 

Yu, H. (in preparation). On convergence of emphatic temporal-difference learning. 


29 



